The problem: You've just completed an amplicon sequencing run on the 454 instrument, but because the sequences are longer than any you've previously generated on 454, you suspect that you may have sequenced through your reverse primers and into non-biological sequence (e.g., sequencing adapters).
The solution: You want to find your reverse PCR primer in each of the sequences, and remove that and all bases following it. To do this, you can use scikit-bio's global nucleotide aligner.
import numpy as np
import random
from skbio.alignment import global_pairwise_align_nucleotide
from skbio import DNA
First, we'll model some sequences. This is quick-and-dirty. Each sequence will contain some biological sequence (which is what we actually care about) with a mean/std length of 400/40, followed by one of four slightly different reverse primers (so representing a primer with 4-fold degeneracy), followed by some non-biological sequence with a mean/std length of 25/2. This is a reasonable representation of what we'd get off of the sequencing instrument: our reverse primer is somewhere in the sequence, but we don't know the exact start or end positions. (Note that I'm not modeling any sort of sequencing error here, and the biological sequence is random, which is not representative of what we'd have in an amplicon sequencing run.)
Follow the inline comments for descriptions of each step.
sequences = []
num_sequences = 50
mean_biological_sequence_length = 400
std_biological_sequence_length = 40
mean_nonbiological_sequence_length = 25
std_nonbiological_sequence_length = 2
# imagine that we have four slightly different reverse primers
reverse_primers = [DNA("ACCGTCGACCGTTAGGATA"),
DNA("ACCGTGGACCGTGAGGATT"),
DNA("ACCGTCGACCGTTAGGATT"),
DNA("ACCGTGGACCGTGAGGATG")]
for i in range(num_sequences):
# determine the length for the current biological sequence. if it's less than 1, make the length 0
biological_sequence_length = int(np.random.normal(mean_biological_sequence_length,
std_biological_sequence_length))
if biological_sequence_length < 1:
biological_sequence_length = 0
# generate a random sequence of that length
biological_sequence = ''.join(np.random.choice(list('ACGT'),biological_sequence_length))
# determine the length for the current non-biological sequence. if it's less than 1, make the length 0
non_biological_sequence_length = int(np.random.normal(mean_nonbiological_sequence_length,
std_nonbiological_sequence_length))
if non_biological_sequence_length < 1:
non_biological_sequence_length = 0
# generate a random sequence of that length
non_biological_sequence = ''.join(np.random.choice(list('ACGT'), non_biological_sequence_length))
# choose one of the four reverse primers at random
reverse_primer = random.choice(reverse_primers)
# construct the observed sequence as the biological sequence, followed by the primer, followed by the
# non-biological sequence
observed_sequence = ''.join(map(str, [biological_sequence, reverse_primer, non_biological_sequence]))
seq_id = "seq%d" % i
# append the result to the sequences list
sequences.append(DNA(observed_sequence, metadata={'id': seq_id}))
print(repr(sequences[0]))
DNA --------------------------------------------------------------------- Metadata: 'id': 'seq0' Stats: length: 428 has gaps: False has degenerates: False has definites: True GC-content: 47.66% --------------------------------------------------------------------- 0 CCTAGGAACG AACTTATATT AGCCATGAGG TAAGACGAAG GTCGAAACCC CCAATATATT 60 TCGCGGGTAA TCAATAAAAC CGCAAAGCAC CATAAGTTGG AGGTTGGAAC CCGCAGTCGG ... 360 GATGGCTCCA CTATAGAGTG TGAACAACCG TGGACCGTGA GGATGGGGGA CGCGGTCATC 420 CCGGAGCT
Now to get to the problem at hand. How do we find the primer sequence in random sequence. The answer is with global alignment. If we align the first reverse primer to the first sequence, we can get a TabularMSA
object back.
Notice that in this step we get an EfficencyWarning
. That's because scikit-bio currently only has a python implementation of global alignment, which is slow because it's a computationally complex algorithm. In the future, we'll have a C-based implementation which will be much faster.
aln = global_pairwise_align_nucleotide(reverse_primers[0], sequences[0])[0]
/home/evan/biocore/scikit-bio/skbio/alignment/_pairwise.py:599: EfficiencyWarning: You're using skbio's python implementation of Needleman-Wunsch alignment. This is known to be very slow (e.g., thousands of times slower than a native C implementation). We'll be adding a faster version soon (see https://github.com/biocore/scikit-bio/issues/254 to track progress on this). "to track progress on this).", EfficiencyWarning)
We next want to find the start position of the primer sequence in the sequencing product, which we can do using the gaps
boolean vector of the first sequence in the alignment (to learn about gap_vector
). The following tells us where the first non-gap character in the primer alignment is, which is the position in the sequencing product where the primer match begins.
gap_vector = aln[0].gaps()
primer_start_index = (~gap_vector).nonzero()[0][0]
print(primer_start_index)
386
So, we can slice the original sequence through that position, and the result will be our sequencing product minus the reverse primer and the non-biological sequence.
print(sequences[0][:primer_start_index])
CCTAGGAACGAACTTATATTAGCCATGAGGTAAGACGAAGGTCGAAACCCCCAATATATTTCGCGGGTAATCAATAAAACCGCAAAGCACCATAAGTTGGAGGTTGGAACCCGCAGTCGGCATCTAGGGACAGTCCCTAAGTCTCTTGTCACGGTTTACCTGCCGCCATATAGTAAAACACAGAAAGCCAAGTAGTCAAAGCAAGTCCCAGTTGGGCAACGCAGTATTGTGTCTCGTGCTGGCATAACAGTGCATCGTTATGGAGAATGCCGATTATCTAGGCACTCGTACAATCAACTGAAGCGATCGTTTATCTTTCCTTAAATCCGTCCAAGATAGTGTCCAAGTAATACAATACTCGATGGCTCCACTATAGAGTGTGAACA
Finally, if we want to do this for all of the sequences, we can embed the above steps in a loop over the DNA
sequences.
trimmed_sequences = []
for sequence in sequences:
aln = global_pairwise_align_nucleotide(reverse_primers[0], sequence)[0]
gap_vector = aln[0].gaps()
primer_start_index = (~gap_vector).nonzero()[0][0]
trimmed_sequences.append(DNA(sequence[:primer_start_index], metadata={'id': sequence.metadata['id']}))
/home/evan/biocore/scikit-bio/skbio/alignment/_pairwise.py:599: EfficiencyWarning: You're using skbio's python implementation of Needleman-Wunsch alignment. This is known to be very slow (e.g., thousands of times slower than a native C implementation). We'll be adding a faster version soon (see https://github.com/biocore/scikit-bio/issues/254 to track progress on this). "to track progress on this).", EfficiencyWarning)
We can then print the result, and we'll have acheived our goal.
import skbio
print("".join(skbio.io.write((s for s in trimmed_sequences), into=[], format='fasta')))
>seq0 CCTAGGAACGAACTTATATTAGCCATGAGGTAAGACGAAGGTCGAAACCCCCAATATATTTCGCGGGTAATCAATAAAACCGCAAAGCACCATAAGTTGGAGGTTGGAACCCGCAGTCGGCATCTAGGGACAGTCCCTAAGTCTCTTGTCACGGTTTACCTGCCGCCATATAGTAAAACACAGAAAGCCAAGTAGTCAAAGCAAGTCCCAGTTGGGCAACGCAGTATTGTGTCTCGTGCTGGCATAACAGTGCATCGTTATGGAGAATGCCGATTATCTAGGCACTCGTACAATCAACTGAAGCGATCGTTTATCTTTCCTTAAATCCGTCCAAGATAGTGTCCAAGTAATACAATACTCGATGGCTCCACTATAGAGTGTGAACA >seq1 GCAGAACTACCGAGACGGAGGTTTTCTCTAAGGTCGTCCGATATTGCGTAGGCAATACTATCCGAGAATCACGTAGAATGTCAATCGTCTTTGGTTTTACTAGCCCATTACAGGAGACCATGCGTGCATACACCCACGAGTCGATGTTGTAGTCACAGTGGTCGACACAGGACAGATACAGCATTATAGAAACGCTAAACGTGCGCCTCGGGTAAGTCTGGCGTGCGAATTTTCAAAGACCGGACTAAATAAGATTCTATTGAAAGAATCTTGGTGTTTATGTTTAGAAACGAAGGTCGGAGGTCCTGCAGGCAGCTCTGCGTGTGTGTATTCGGATGCACTCTCATCCGAAATCTTGTATATAACGTCAGTATCAGTGGTTGTTGGTCCTCTAATAGGTCCGATGG >seq2 TTTTGACAAACAAACATTAGGCCAATCGGTAACAATAAAACGGCAACGGCAAGGATACGTCGTACATGACCACTACCGGGCAAACGCACGAAACAACGCTAGAGCGTGGCACTCGCACGGTACGTGAGCCTCTACGAAGGATTCATCGCTCTAATCGTTCCGGTGTCACGCGTAGAATAACGACTCAACCTAGGCAGTGGGGGACCTTCGCTCTGAGCCGGTTGCCAGGTGGACGCGTTAGTGGGAGAAATACCGTCAATCATGCTGGGACACCGACTGTTGGACAGAGGAGGAGTCCGGGGCGATTGCACCAGGGCGGAGAAGTATATTCGCTTACATATGGCCAACCTCAGTTCAAACTTTTCTAAACCCAATCAGAACCATGTTCCACGCTTGCGCTTTGACACGAGAGCTGGGCCCAAATGACACACT >seq3 TGCTATTTATGCACGGCGTAGGGCTCTACGCTTGCGAGTGAGAATCTGCAAGAACATCGCTTAGACTTCCTCTCAGAAGAGTGCGTGCGATGGATCGCAGAGCGATCGGTATTGGAGAAAAGATAACTGAGGATCTCCGTTATTGGGGTGTCGCGACCGTGTAGACGACCCATGACTCCACCTGTGTCTACCGCCGACCAGCCCGAGATTCCGTACCGCAGCTGTATTACCTTGTCCACAAGCAACTTACGTTTTTGCTAGTTTCTTAACCTTGATGATACCTCGACTAGAATAGCCCCAAACTCTTTTCGAGAGCAAATCGAGACGCCTGTGTGGGTCACTGACACACGATGGGAAGGTTGGTATTGGCCACTCACCCACACAGACCCTTGGTCCCCAACTACACTGTTGATCCCCGTAGATGAGGGTGCAATGTTTATTAGATCGGTTAAGCCG >seq4 GGTCCATTCCCCGCCCGCGTCTAAGCAGAGAGGTTTAGTGGACACACTGGCGCCTGAACCTTTGGTATACCTATGAGCAAGGGGTGAATACCAGTCCGCGGTTATCTCCGGCTGCAGTCTCGTTACGCCGAGTACTCTCCATTACTGAGCGACCTGATCATCCACCGTCAGCTAGATCCATCCTGGGGGTCGGGCCTGTAAATAAGGAGTTGCCTATTTTCTGTGCTTGGCGTAGATATTAAGTCTTTAAGTATGTTCTAGGTTTAACTCTCTCGGTCAATCCCTCCAACATGGAGCCACTAACACTTGCGTAGACACCTCATCCGGTTCGACCGCCGTGTTTCGACTTGGTCTGGCCCATACCTTACGTGAAACGCCTGTGCATTCCAGGGGCCCCCATTACG >seq5 TGTGGTGATTTGGCATCATAACGGCGTCAAATTGAGAGGCGCGGAAATCCCGACACGCTATTAAGCACTACCTTGGTCACCCCGACTGTTCTACTGGTTGGAGGGTACGCTTAATGACCCCTGACACCACCAGCTTAAGCGCGCGGTAATACGGCATGCCACCCCCGTCGATATCCCTCTCACGGTGACGCTCCATTTCCAGTGGTTAGCACTGTTTAAAGGTGGGTCCAACGGCAGCGGGGATTGACAGTAGGGTTCAGACTCAGCATCTTCCTAAGGGGCGCGAAAGGTAACATTGCCCGAGGGTTTTCCGCTCACATATG >seq6 GATACATTGGGTTTCGAATGCGAGAAAAGATTATCGGCCGGATGTCAGCCTGCGACTTAGAAAGTTGTGCAGGTACCGCTTGTCGGCTCCCTGTGTTTAACTTCACTCACTTATTTAATTCCCCTTAATGTGGCTCAGCGTATTGTCGAGTAACGTAGTATTTTACTCAAGAGCTTGGACACGCTTCATGCGGAGTCTACCTCTTGTTTCAAATGACGAGAGCACTAAAGATCCTGACAAGGGTAATATACGTACAAGCATGAGAAGGCGTCGTGACCACTCTTGTATTCACACGTTAGTCAAAGTTTGTAACGTGACTTAACCTCCTAAATACTCAGCCACAGGACTTAGACGGCAAATCCGGTCAGACAGTGAGCGCCTTCCCTTTGGGCGGAATCCCGAATCCTCTCCTTGCTGGCGCCC >seq7 TCCACAATAGACCGGCGGCCGTAGTACTTAAAGTTGCGGGAATACCTTTCCCGCCTTTCCCCTAGAGGGAGAGTAAGCTAACTGGGCACCTAATCATTGTACGTCTATCCTACTGCCTTAAAGTGGTGTATGACCGTGTGGTACCCAAAAGGTTGTAGGGCCGTTCTCCATCCAGACGAAGAAGATCTAATGGGGTCAAAAGACATGTTAACATTACTATGCACAACGCCGCGCCTTTGGCGTCTGGTACGGACGAAGTTATACTAACGCGTGGTCGCTCACCAAAAAAATTCCGTAACCCGGCACTGAATATTGCAGCAATAACGTTCACGTTATTAGAGCTAGAAACCCACCACACGTGCGCGGCGCTTGACGTGGTCGGCTGCACAGGTTATCTCTCCCGGCGCACTGGGCGACCTAGCGGCCC >seq8 TCTCGGAGTCCACGTTGGAGGAGATTCTACCATAGAAGTCGTACGCCGTATGGCAACCTATGTGGAGCCAACATAAATAATCATTTGTACTTTGGCACCGTACCTTCCTCTTTTCTCCGAGGACAGAGGATACAATGATGCAGTGTACTTACGATTATCTTGGTAAACACACGAGGGATCTCCGTTAGACGTGTGACAAAGATCGAAGCAGGGTATGATATAATTTTTTAGCTCATATACATATTTCGGATATATACATTTGAGAGAGGCCCGTTCGCTCAGCCCCGGACACTCGAATGTCTACGGTGGCCTCGCAAGGTCTGCATCTTCCATCAGACCAGTTCACAAGATACTCTTACGCAGTAACTGTCCATATGCATAGGGCGATACCTTAAACCACCGTAGGGATCGCTCTGTTTTAGCGGCGTGTATCGACTTTCTCCCTTGGGGCTAAAAACCTG >seq9 TGTCAAGGTGTCGCTTACTCGTAAGTTCTTGGAGTCCCCAGCCGAGGACGCGTCATTCGGTAAGGGGTGTCGGTTACGTCCATTTGATCTCTATACTTGGCTATAACAGACTACCATCCGTCTGATAGGACTGTATCGGCAGTGAGGGACATTAAAGTGTCAAGGCTGTAATTGAACGTGTTGGCGTAAGTGATCCCAAGATGTACCTTTTTCAATGTGAAGCAACATTGTTAAGTCCGGCGGCGCAGTCATATATTCGAAACTCGTCATGATGGTGTCGCGCCGATTACCTACGTATCCGCAGACAAGGCGCCGTAGCTACCCCCGGAGCCCCCCGAGCTGCTTT >seq10 TAACCGTGTTTAGGCCAAGCTGCATTGCTATGAGTGGTACATATCGACTTTGAGGATAATGCTGTTTGGTGTGTCAGGAATGAGGCGGTTCTATCCCGGTAAGATAGTATATCGGTGCCGGTTGAGGCTAACAGGAAGGTGGGGTACCCTGATAACGCACAATGAGATGCGGGCAGCCGCAACACAAGCGTCGCCGGCGCAGGACCAGACTGTGATTGACATAATCGATAAAATGTGTGCTAACATAATGACCCTCTGAACCTATGCACGCGCTCTATTGTGGGTAAACGTGGATCCATGATACAGGCTGCCCCCCCTTATCAAGATCAACGGTGCATCCATAGATATCCGATTGGTCCGGTGACGTT >seq11 GCAAAGTAACCCGGATCAGAGCTTCCAATCACTCACGATCAACCTACCCCGGAAGTCGAGTCTTGCTTGTTTCAGCATGACGCCACGAATAGAGACGGTTTGGTATTAGCGTCAGGACGGAGAGAGAACGGAATAGCATCAATAAAGTTTTTGTATACATACATTGAATAATGTGCCGAGAAAAACTTGACATGCACAGCGCTGTGAGGCGGGGTCCAGGCGATCCCATGTGAGCTAGCAGTCCGCAGGAAGCTTTGATGTAGTTTCTCCCGCGTGTTCGGACCGTACTTCACCACGCAGTAGGCAATTGGTAGATCTGTAAACCCCAGCGTAAGGGGTAGCGCAGGATCCCAGCTGACCGGGAGTGCTATCTCAGGGTGTAGTGCATCAATGGCCAGCAGGCGTTAGTAATTGTAGCAGCACCCCCGAAGCTACA >seq12 AGTAAACTTGAGAGCCGAAGTGGCTGACGGCTTAGTGGTATTTTCGCCCTTAACTGTTTACGGCAGTTTAGGCGGTCTTAGAGCTCCATAACGTCTTCTCCGCTCCCCGGATACCGGCTATAGTAACATGACACATCGATAGCCAAGTAGGCGGGAAATACCACTTCTTAGACGCATAATAGGGGTCAGCGCCAATTTCAGGCGGTCAGGACCCCTTCAACTAACGACGTTGCGTAACCCCGACTCTAGGGAGAGAGCCTAGTTTAGACATTATAGAGGAGAACATCACTCCGTGAAAATTAAGTTACACACTCCTCTGACAGGCGGCATTCTTCGGGAAAAGGTTCGTTCACCGTTGATGAGAGGCATCAAGTTCCACGCACGCTC >seq13 CATCGCTAGGGTCAGCGCCCCGTGGCGCGACCTGTAGAGCTTGTGGGGAGTATTGTCGCGAGTGGTAAGCGTGTAACTAGAAATTACGGTGCTATATACTCCACGAAGTTACAGAAAACTAATGCGGCGTGACGATTGTATTTGTCAACCCTCCTAGGAGAGTTGTGTATGGATTCTACGCGCCGTTCGTAGTCCGTTGCAATGCCTGCAGTGTTCCCGGTAAGCCTCGAACCGGCGTCCCGGGTAGCAACGCATCACGTTGAATGATCAGCGATTTTTAGACCGGAGCTGTTCGAGCGGTCCAGGCACCGTAGCGGCCCGTATGGCTGGTGCCTGTATGTTTCGTTCACGACTACGGAACATCGGAAAGGGCACCTT >seq14 GACGTATTAGATGGCTTTGCGAGTTCAAGAGCTTCGCCACTATGCATTTTGTCCAGTATTTGTCCCAGACACCCTTATGATCTACCCGCCACACGCTCGCTTGCGTTGTTCTGGTTCGCTTGTCTATGGCAAGCATAGCTTGCTTTAATGACATTCGACCGTAGTTTGGTACTCGTCTGGGAAGACTCAAGACTAGGCACACAGTGAAGGAGGAGGTAGACGCCCCGAGATCGTAGGATGGGTAGTGCGAGTATTCCTATGATCTACCTACTCGGTAAGGTGTGTACCTGGATCCACTCAGAAAACGACGGTCAAGTTCCAACATCGCTAGCTAAATATTCTCTGACACGCAGCTATTGATTTGGGAAGTGAACCACCTGGGGAAAGTGCATTAA >seq15 ATACATGGGCATGGTTCTAAGCCCCACGTCTACGCCCGGCAAATTTCGATGTTTTGCTGTAGTGATATACTGAGTTGGCTCAAAACACGAACATAGTAAATGCTGACGATATCGACATTATGTTTGGCTTTAAAAACTCTCGGAAATATCCCCGTGGGGGTCGACATAACACATATAATGTGCAGCCCCGACCCCGACGGTGACTCTCGATCGGCGCGTCCCGCTCTCTTGTTACAACGGTGGCCCCGATTGATTGTGAGACTCCTCATGCGCGCTACCATGCTAAAAATGCTCTCTCAATAACAATGATTCTCACGAGGATAGTACCCACCGATCCCGAGTTTTATTACTGTACTCAATTGCCTACGCCCTGTAGCTGGCGCTGGTGGAGCATTGTTGTATGAAAGCTAATAGTAAACGGTGA >seq16 TCGGGGTGATCGGTTAAGAAGTTTACGCTTGTCATCCTCTGTGGAGGGGTATCTGCACACGATAACCGGTTATCTTAAAAGCATGAATATCCCACACACTGCCTGAGACGGTGAGACGCGTGGACGCGAGCCCCTCCAAATCCCAGGAGTACAGAGTTCCCGGGCGACGCATTGAGCGTCTCGGAGGGATACAGATATCGCGGCGTTCAAATGAGAGACGTAGCTGCACGAGAAGATCAAGGTTGCCCAAGTATTTGTCGGCGACTGGTACAAGAGTGCTCTTACGTGGCTTGAAGGCGCAAACCTAGTCCCGATGCGGCGTATCTCTTGGACTGAGGGGGAGATCAACCATCAAAGACTAGCCCCAGCAAACCGATTAGTTCTAATTAAGGGTATATAAAATAGAAATCATTCTCTCCACCTCAAACAATAGACGTATACAACAGACCCGACTACGGCGTC >seq17 CCGCGTACACGAGCCAGGTATTCCGAGACCACCTATATATAGTATTTACTGGGGCTAGAATATGGAGCATAGCCGGGACCAAGTTTGCAGTGACATTACAACGTAAGGACAACCTCAGATCCCTAGTGTTGAAAACCCTCCGTTCTGGTGCGTATTACTAACTCCTTTCCGTCCTCAGGGCGCCCCTGTGATCACCAATAGCCGAGTCTCTCACCCCGCTCTTAACTAATTTAAGTGAACGTCTTCGCTGATAGAGGGAATGAGCCCTGTATTAGCATGGGCATTTCTGTCATTTACACTTTTTTTGTAAGCGTGCGGAGGGTTAATTCGGTACACCGCAACGGCTAAGACGACATTAACCGGCCCGGCTGCC >seq18 ATTACGCAGTAGTCCCGGTAATATGTGGTCCTCTTGCAGTAGCGCCTTATTACAACAGCCGTCGATTACTGGCAATCTTAAGTGACTTAGAGGCCCCGTCTACGGGTCATCAATTGTCGGTCAGCTATCCAATGATAAAACACGGTCGATAACCGTATGGCGAAGCGGGCCAGCCCGTTTATGGTCAGGTTCCATATATTGGACAGTGTCCGCTCTCCCAGTGAGAGCGAGTTTTATGAGTTTACCACTTAGGCTAGAATTCGCGATAGCCCAAAGTCGTCGATGTCGTGTCCTTGCCCCAGCGGGACTTATACAACAGCGGTTCTCAAAAATTTAACATAGAGTCATTTAGCCCTAATCGGGGATGACCCCTGCGGTGGCACAAGCTCATGTTCTCGCAGAAGCAAATGAC >seq19 TGAATGTGTCCAACGCGATGAATGACGCAACCGCGAGGAGCGAAGTGAGCGGAGGTGTTCCCGGTGTTGCTGCTTGACAGCTCTAGCCATCTAGCTACGTGTAGCAAGCTTCGATTTCACGCGACCCGACCGTGAGAGGTTGGCTCACACCTTCCATGACATTGGATCTCGTTGGAGGTAGTCACGAAGGCCCACATCCGTTCGCGGGACTTAGTTAATGCATCGGGATTTATTGGCATATTGTACTTTTTTGACCCAAGGCCGCGACTTCAGGCCGCACGCCAAATCGCTACTGGCAGTACTATCGGCACAAGTGTTCGGACCCTTCGTCAATATTACACAAGTTATGAAGGAGGGGTTATTCTGGCACCGACCCGTGGTTTTACAATCT >seq20 GGACCACATTATATAGATACCAGGATAGCCATGTCATACTCAGTTGTTTAGGGCTTGTAGCTTTGCTTCGAGGATGTATAAAAGAATGGGGCGATTGACCCAACATGTACATCAGAGAGGTACCCCACCACTTTACCATGCAGTTGTGTACCTCTTTTGCGGAGTCCCGCGGTACTAGTGTGTATTCGTTAACCGTTTATTTTGGCGAGCTTTGTTCGATTGGACCCCGTGTCCGAAGTCCGGGATTCCACGGACCGCATAGGTGTCAAGTAATCTACTAAATCGCCCGCCTTCCTACTACTGGACAAGCACTGGGATTCTAAGGAGTCCTTTTAGGTAATAATAGTCACTGTAACAGCAGAAGTTGTGTTAGGACGACTTAAGTAAGTGGTGGTTCTAGGTAAT >seq21 TTTCGGCGAACCCTCCTGTCTGCTTGCGATCGGCATCAACCATGAAAATAACTACTTCCTAAAGCGGTAGAGGTAGAACAGGTTGGATGAGCAGATTACTGATCCACATGATCAACGGCACTAGCGCCACATCGTTACGGAGAAGAGGCGTCCGCCGAGTCCATGTGCATAGTGTTTATTCCTCTGAGAGCCCGAACTGAAAGTAAAGCCTTACTGTTAGTTTAAACGTGAGGGTTCAGAAATTGGCCACTAAGTGACCCAACTGCATGCGCCCAGCTGCCCCGACCACTGCCCCGGTTTCTTTAAGACCCAAGGAGGAAGCTCCCTGCTGGTCATGTATTACTAAAGCGAGACATCTTAGCGATAAGTAGCGGAAATTAATGACA >seq22 AGTAGCGTACGTCATTAAAACTTTCAAAGTCCCAGTCAACGTGGCCGCCATCTGAGCCATAAGGCATACATACTGCGAGGCTCCAGTATGGTCACGAGTTTAAGCGCTCCCAAGCCCGATGGGCTTATGCTGATCCGTACATTGCGTACTCTCTATCAGCCGTACGATGGCAACGAGTTAGTGTTAGATGAATCCAGGGGCTGCCGTCAGACTGCCGTAAGCTTCCTTGCTTTGGTTGAACACTAGTGGTTTTCCTGAGTAACCTATATCAAGCTGAGGGAGTGTGCAATTATCGAGCAATACTTGAGGGCCTATGATGGACTTGTCCAAGATGTAGAGGGGGCTTCATCGTTAGGGTACAACGCAAGAGTA >seq23 AGGCGGAGTAGAACCGGTATAGAGCTCTAGAAGGTGTTACGACCCCTGAGGGGGTATGACGATAGGTAAGGATCAACTTGAGGAGCACCGTTATGATGCGGGGTATTAATAAAGTCAGTAAATGCATGGCTCGATCGATAGTGTATTTCCTTCTATGGATTTCCTCACTCTAGGACTAAAGGGCGGCAGTGCGTAAGGGACTAATCAGTTAGGCTGCGGGAGTAATTTGACCGCTTACCCGCTTCTTTCTTCGCCCACAACAGGGAAGCGCGGTTTGAACCTGAGATTCACTTCTCCGCACTTTCTAG >seq24 AACATCAGCTCCTCTTCTAAGCTGTTCCAACGGAATAAACGGAGTCTAACGTACGGAATGGTAAAGTCTTCGACGCGATAGTTGGATATATTGGCTTGGGGAAGTGACACGAAGGGATGAATCAATCGCCAATCACCCTACCTGGGTATTTACATTAACCGAGCGCTACAACTAAACACTGTGCACTCTCGCGGCACCAGGAATTGAGTCAAGCTTCGCACTGGCCTACCTCACAGGGGGAGGGTCATATTGTTCGGCGCAAGATACGGGGAATAGAGGCCTGACTGCGATGGAACAATTTGACTGGCTCGTATGCAGGCAATACACCACGAGATGAAACAACCCTGATCACCGTTTCCCCTCTGCGAAGCCAAGGGCCTATGCATTATCTGGCTGACATC >seq25 GCGAGGAGAGCCCGGTTTGTAATCTGTCTACCAGTTGAAAGCCGGACAGCTAAAACTGCGTGCGGCCACTTTAGGCTCCTGGTCAGTGCGACGCGGATGGAGGCGGAATGGGGAGTGTTTACGGGCTCAACCCAGGAATGTCTCCCTAAGAAGATCGTTCTGTCACAGGATCTGTGTATACTCCCCACCGGTCTATATTAGCACGGCTCTATAGAATGAATCCGTCCCACAGAGTTCATAGTGGTCAAAGAGGAGCGACTCAGGTAGACACCCACGAATCATTTCCTGGAGATTTTACTTTGGAGACTGCTGATTTCAGCCACATGCTATCGTCTCAGGCGATCGGCCCCTACCATCTCAAGAGTTGCAGAGATCCTGTTCTCAGCCCTGTCTTGAACTGGGCCAAATTGACAACTCAGGGGAAACAGGTGGACCCCTGCTTCCCAGTGCAGACAATCTAGTCCAACTAACTTAGA >seq26 TCGCGGGTAGACGAGCTTTGGTTACAACCTCTAGAGTGGACCGTTCGGAGGGTATCGGGCCTACTCAGACAACTCGATGCTTTATGACATTGCACCAAGACTGGTTTTAGCGCCCGAGCATGTATACGTCACATCATGAGTTTGGCTACTAAGGTCGATAGACACAAGGTCTGAACAAACAAGCATAATATTTCCCCATCTTTTGCGAAGAAAAGGCCGGCTGGTACTTGTGGCTGGGTCCACATTGGCTAGCAATTAGATTGCATACAGCTTGTACACCCGAAAGGATCTTTTGGGATAGGACGCTGAAAGCGCTGACACCAGAAAATATCTTCAGCCACCATCAAACGCTTCGCGCTTGAGGCATTTTTATAGACGACTCTGGCCCCCCCTCTGTGACAGCTTCGTAAAATTATCCTCGCAGGTGGTGCCCCCACGTT >seq27 TACATGCGGGCTTATACTTTTGGCAGACCGCGTTGGCAGTAATCTAGGTCACATTTCAACGAGATAATAGTTAGCCGCCGCCTCGACGTTAACCTGCATTGTTAGTCAAACAAAACTGGGGTATTCGACTGGCCATCACATGTGACAGACATCCCAAACTCGTCATCACTCGCTTCCCCCCTAAGGACGGTTGATATAAGGTCATTGAAGGCAATACGGCGACGCCCACTCCAGCATTTCTACTGTGGCCGGCGACTGCTAGCCAGGTGTGCCGCTTCGGGAGTATCCCGTATAACTCGTCTAGGTATAACGATACATCGTTGTCCAATCCGGGTCATGGAGCAGCAATCTGATCAAGCCTTTTAACTCGCAGGGTGTATACTGGAGGCTCCTACCTTGACTTTGCATGATAGCGAC >seq28 GTTGACTTGGCCTTAGGGCGAGGTGTAAGCTAAGACCAAAAGCCAACATGAAGTTATGATGTCGTATCGTCTGGATCGGGGCTGACTATCCCCTCCCCTGTTGTCCGGAGCAATGCGTCTAAACGGAGAGGCGCGGAAGATCGAGCCAAACTATCTGCTTCAGGGCTAACATCGCTTATAGGTCGCTAGAAAGTTCGTAACTCAGATTTTGCGGCTATTGCAGTCTTTTTTCTTTACAACCACTCCTGCGCACATCAGGCGACTCCTGAGAGCCTCTCAGGAGATAGACGTTGATTGGTTTGAGTGAAGTATGTATCCCATCAACCCGGTCAGCCACGCGAACGAGAGACAGCTCCGCTCGAAGCGCATCTTATGAGCA >seq29 AACCTTTTTTTGGGCCATACCTGGATAAAAGACTCGTGATTAGGAAAACGTTGAGCCCGATATTATTGGGGGCGCTTGGGAGATTAGTAGACTGGACTGAATCAAAAGACAGGCTGCTTTATAAGACACTATGCAAGGTCATCGGAGCTACCAACTACTTAAACAGCGGGCAGATATTGTTGCATAGGTGTTATAAAGCGCACGGCACCATTGTTGAAACTCCTATTCCCTTCCAGCAAGCCATCGGGATTCGAGTACCATTCACCTAGTCACAATTGCTGAGCTTGTACGGTGGCAGTATCAGGATAGTTCAGACAGACACTTATGAGTCGTTTAAGTAGCCTTCAGCGGATTCCATCTCGCGTG >seq30 TAATATTTGCAACCTTGCACAACGTAGCGCTCGGCCGAGAGTACTAATGCTTGCCACGTATATGATTGCCCTCATGGGTGGCCTGATAGGTTTAACATAACGCAGTAGGCGGACCACAAGCTGGAAAAGCCTTCACTCCATAGCATATATCCTCTTAGAGCGGCACATCAACGGAATACAAGCCAGTTTCCCAAAACATCTCTAGCACCTGGTGGCACTAGATGGGCTTGTGTGCGAGACTGTGACGTGAAACAGACTCCTGGCTGTCCAAACGTCAACCAGACCATAGCCAATATGCTGCGCCCGTCTCGGATCAATTTAGGCATTGCCTCACCACATTTTAATTCGATGGCGTTGTCACCGTAAACTCCGAT >seq31 CTTCGAGAAGTTTAGTCGAACAATTGTGATAATAAACCGACCCACGTACGAACGCAGCTAACCGTCCGGCATTCCAGCGATAGACGGAAGATTAAGTAGCTCCAACGCCGCCAAGTCTAGGTCTATTCGGGATATTCGGTGTGCCGAGATAGGAGTCCACACGGAACTTTAGCCAGGTTCTGACGGATATTGGCATTGGGGCAACGAGCGCATGGAAACCCCACCTTTTGTGGTACGCTGATACCTCAGCCCTACCACCATATTCTTTAATGTCTCATGTGCCGTGAAGTACGTCTACGTTACAGTGACTGACCCTTCCCCCTTGTAATGGATTGCGGGCTGGTTGCTCAAGGCTAATGCGATCCCCGCCGGGGAAGTGTGCCACTTCGTCATGCTTCAGGGCTACAGAAAGGAATTGACCTCTAG >seq32 TGGTAGTGGGTAAAAGGCACCTTTGAATCTGCTCATCCGAGCTAACCTCCACTAGGGCAGCAATGCAAGGGAGCTGATAACACGCCCCGTAAACCCCGTTACAGCATCATTGGACCACTATAAACCATTCGGTTTCATCTGTGTTTTATCCACGAAGCATGGCCATGTGTATAAACTGACGGATTGGCGTCCACTGTAGGGGCCGCATCGTACATAATCCATGCGTCCAGGGTGGAGACTCTGGTTGACCGTACAGGTCGTCTGTACTTTGGCCCACAGCCAACAGGAATACCTGAATCAATGTCTAGTGCTTGCCAATCATGTGATCGAGACGAGCCTAGAGCAGTTACACCTGCTTCGTAAAGGAGGCCTTCAAAGCTAAAGGTCGCTAAT >seq33 ACGACCGTTCAAACTAGGTCGGATCCCGGCCGTAATCAGCTGTCCTGTATAGCGCGTTCTGACATAAATTATTCGTGATGTGCCAGTTGTCTGTGCCAAACCTAACCTTCCTTCTTTCGACCGTCGAGCACTCCACTTATCCTCTTAATTACGTAACAGACAGCAACTGCATACTAGATCTTCAATACATGTTTTGCGGGTACAGCCCGTCTGGCCCTGTTGCTCCGTGGAGGAAATTAATAATGGAACTCGTAAGTTACTCGCTAGTACCCATGCCTAACTTCGTTGGTTTAACTGTAGAGTCGTACTCCGGTGGAAGGTGGTGGTCAGAAGTTCGTCACGGGTCTTAAACACTGGCGTTTGAGCAGAATAGGCTCACCCTGTATCGTTAAAGATGGGTCGTCCTATCCCGTGGTTTGACCTCTCTTACCGCCTCC >seq34 CTCTAATGCCGGTTTAATTGCCGGTAGATACATGGAATGGGCGATTGAGTGTCAAGTTCCGCATCCGAGTAGTCGTAGGGCACTTTTCCCCAGGTTGTCAGTCTTGAAGTCACAAACTCAAATGAACAAGAATCACTCCGTTGTGAGTGATGATTATGTAGATTGTGGGATCACATCCAGGTGACAAAGCATCGCATATTAGTACACCTACGGTCCTTAGACTTATGGCATGCAGCCACGCAGACATTCACAGGGTGAATTGATGCTACACTATAGTCTAGCGTATTTGCTTTAGTCCCTGGCTTCCAGTATTGGTCCTGTCCACAGCTCTACTCTGCTACCGGGCCACTTGAATCAGCTGCGCCAAGCAATCGGGCCGGATAAACCTCGCCCAGAAACA >seq35 GAGTCCAAGATTGCTTGGCTCAGGGTGTACCATTGCAACTATACAGGCTGGACGCCGTGGGAAGGATAGAACATCATGCTTGCATGTCGGCGAATTTTTGGCCTGGATCTAAATTGGAAATATATCGATGAAGGTCTCCCTTACCTCCAGGCCGCGCCCAAAGTATATGCAGTTCGTCGTGTGAAACTGGTCAGTGGCTTCGATAGATAATCCTCGGGGCAATACGAAGAGGTGCAGAGCATTGAATAAGAGCGCAACTAGTCATTTGGCTTTCACAGGATGGAGCTGAAATCTT >seq36 CGTAAGCGTGCCTACGACTGTCGCATCAGAGTTTCAGACGACGTATGATCCTACCTCAATGGCGCATGAATGACGTGCAGACCGGGGCGCGTCACATGTCCTAACGAACACGCGATGTTTTAAGCTCACCTACCAGTTGGCGACTTGTCAAACCTAATGATGCTACCCGGCTAATGGCCCGCTATTTGACCTGGCGGCACAACGCTTGGTGGGTACTGAGTGGGGGAATTTGGTGACGCCAATTAATCGGTTATGGTTATTGGTTACCTAGAGCCCAACCCGCCTTGTAGATTAGGCGGCACGGACGACAGGGTAGCTCTACTTCTGGAAGATCCCCCAGCTTATACCGACCTCATCAATTGGCTGAGAGGAACTGGGGAACCGACTAGTTAACGAGACCACGCCTCGTCGGGGGTGCC >seq37 AAGGCGCTTTTCGATCGATAATTTCCAATTGGGTCGACATCTTGGATCGAGCGAAGTACATGGGCCGTTCTACGTATATGGCCATTGGCCTCTCGACTATGCCACCTTGCCATTTCCTCTACCTCCATAGGCGCTAGGTTTAATAATATAGTTAAATTCAGACTCCTTTTATGCTACCTTTAGTGAGGACCCGCCCAGAATAGGGAACACGCGACTGTGGACGCGTGAAGTGTGCTTTATTACGTCGTCCGGTGGTGAGTGTCGACAGATGCAAATATATGAGGTAAATAGCCGCTCACATCAACTGCTCCGGTTACGGGCCTAGGTTTGGGGGCGACAACGCTTCTTCTTGACCAACATAGGCTATCAC >seq38 CCACCCGTTCGACCGTCTGTTGTTGAGACAAGGGGCGCGGGTAGGTACAGCGTGAGTTCGTAAGCGGCAACGCTTGCGTGAGCGCGTTTTGTGTCATTCTTGAAGCCACCGCAAGGTAGCTTCGCCAAATAAATGACAAGATCAGCGCTTTACTGGGGTGCTCGTAGCTTACTCCGGTAGACAGCTAACTTTCACCCGGCCTTGCATAGCCTTGATCTCGCTAAACTCGACCGGTCTTAAGACACGCGACTCAACAACATTCCATCCTGCAATCGTCGCGCTAGAAACCTAAGTTATAATTTATTAAGTGCATACCTGCATGTCGAGCTACGAGTTCTTTGCTCCTTGGAGGTGGAGATCGAACTGGCCGTTCGATACCTTATGTCGATCTCGGCTATCAAGTCAACCCC >seq39 CACGGTGGCGTTATTGATTACTGGGTCGGCGCGATTTACGGTTTGATAAGTCGGCTTTCGCTCGTGTTAGTCACTATATGGCCGGTACAACCGTGAGGTGCAGATTTATTTCACAGATCCGGTGCGATAAATGGTTTGTAAGGGTTCTCATTAGCACAGGGTAAAAGTAAAGGCGGTAGCCCCAAGGTGGCCCCCCAGACAGACTAGGATGATGTGTGCGACCCTGACAATTATGAACAGTGACAATCAGGCATCGTAATGCCCTTAGAGGCCTTGGTACCACTAGGCGTCCGGATCCCCAGTGGTTGACTAACAAACAAATAGTTAGGCTTAGAATTTGCGAGATTCCGTCCGACCTGGAGTATGCCATCTCTCAAGTCCACACTG >seq40 TGAAAGAACGTTTCTAGCCATTAAATGCACCAACAACCGCCCAAGATAACTCTATGAAACGGACCCGATTGTGAAACAATAACCCCCCCAATAGTGCGCTCTACTGAAAGACGGTTCCAGCCTAATTGACAGTGTACCCATGCCTAGCTGGCATTACCCTATGATTGATACGATCATAACCATACCGGTGTCCATCTTATAGACCGTAGACGATTTGGAGATCACCAGCGCAAAGAAGCATAACCTTTTTACTAAACGTGATTGCCAATCGCCAACTTATGTTGCCCGATGACAATAGAAGGCTGGCCTGTATGGCTTTTATACTGGCTCTTGTTCATTCGTGCTCCCGCCGAAGCTAGGTCACACTCTATGCTCCGCAAGGACAGAATCGCTAAA >seq41 TTTACGGATAGCCTAACCCAGTCGTCACCACATCGAGTTAGTAACCGGAAGCAAGCGACCTGACCAGACTTCCTCGGCTATGTCATATACATTAGTGCTAAGTCCCGTAGTCGCGGGTAAGTACTCCCCTCCACGTACCGATTACTTGGTTATCAGCTACTAATTCTCCGTCTCTTTGTATCAAATGGAATGTATCTCAGGAATTGGTTGACCCAGGCCATGCTAGCCCCCGGTGTTATTGGACTAGACTTGTTTCTGATATCCACGCTTCGGTAGCAGATGACTATAAACGCGCGGAGTCGGTGGCCTGGCACTGCTGGCG >seq42 TTTTTAATTTTGCAGTAAAGCTTCCCAAATCTGCGCGTCGCTTGATACATTGTCGGGACAAGCGCGTGCGCTGCGATGTCCTTTGCCTCCGTTGACCCGCGAACGTGGGATAATACGCGGTATTGCCCACCCGCTCGGCGGGAGGCCTACCGTTCTAGTTTGTGTACATATGGGTGTAAGCCCGCTTCGGCCCGCAGTAGTTTTTTCACCTGGCAAGCGATAACGCCCCCCATTCTCACCATATATAGAGTAACCGTTGAGAAGTTCAATTCTTTTCTCGTA >seq43 TGCCCCGTAGGGCCCGGAAAATCACCACCATTGGGGCGAGAAGTCTGATACAATATGGCGACGAGATGGGATTCGTAAGGATACAAAGCTTCGACCTTCAATTCTACACGCTGTAACAGCGCCTCATTCCGGAGGTCTCTCGTTTTGATGCGCACGGCAGAATACTCAATAGGGCCCGTTTCTCACTGTTTGTAATAACCGCTCGCATCATCAGCGCTAACTCCTTGGTACGCAAGCCTGATCACTTTTTTTCTACCAGGCGGCTTTTTACTGCCGGATCTGGACCTCTCGCCGAGGTCTCCGCCGGCTAACTTAGCGGTACATACATAGACCTAGG >seq44 TCACCGAATTCACCTTGAAGCACTCTGTACATGCGTACATTTCGATTAGGTCGCAGCCCTACTCATACCGTTAACTAGCGGCATGCAGACAGCCTCGTCCTCTAAATCTGTAAACCTGGACCATATTCGAAAGGGCTCGACCTTCAAAAAAGTAAGGATTAACGACAGGCTCCTTAATTCCGAGTTACCTCCATCCACGAGCGTACGGGCGACCAGACATCCTAGGGGTTTAGCAAATCCTGACACCGAATTATCTGAGACCCCTAGTACGGAGCGAAAGTGCTCACCGAAAGCAAGCCAGGGTTTAGTGCCTTCTAGATCCCGCCGATTACCTCGGCCACGTACAGCACTGCCCGTCGTGCAGCCGCTTGAGGGTGATAGACTCCTATAGGGTGTTTA >seq45 CCTGATTAGCTTGTCTGGATGGGGCCCACCTCCAGAGTTCCTCCACTGGAACCAGCCTTCGAATACCGCTTTCTATTAATACGCCTAGGAAGCCGTAGATGGGGACCCTCCCCAACACGAAATAAGATTCAGGCATAGCTTTGGATACAGTCCCGTTTCGGTAGATGGGTTGACGGGCGGGTTAAGACGGCAAATACGTTGAATCTACTCTACGGTTAGATGGCTGGGTGGTAGCTTGTGTGACACTTAGAAAATGCAGAGATGCAAACTAGGAGTAATTCCCCGGATCCGTACAATCCTTGGGCATACAAGAGGAGAAAAACCTTCCGAATCCGGCATTCCGGTAGGACAGTCACGGCAATGCGGGTGCGGGACATGTGGTTAACCGGT >seq46 GTTCATCCGGGTCTTGCTGGAAAGCGCGCGGAACTTACGACGGACCGACGGCATGAAATCTTGGTTGTGCGGGGAACTGTGGCGTGTGTTAGCGGGCTAACGACAGGGTTAAACCAGATTTGACGCCTTGAGGGTAGAAGCGTGCTTTGTTGAAGATTAATGCGCTTGTCGGGTTTCGCGGTTTGGACGCTCGAAACCCTCCACGCATGCATTCTAAAGTGTATATCGAGGTCGGACTGAATGGCAACATGGTAGTTAATTTGTACGCCACTACCCAGGAGTCAGCTCCGAAACCAGTGCACCGCGCACGGTGGCTTAGTTGCACGCGAGGGGCCATGGACCGGCTCCTCTCTTGTACTATCCCACAGAAATCGCGGTGGTATACCCCTA >seq47 GATGTCGTACGTGCAGCTACTTTATCTGACCGCAATGGCGTAAGTCGGACGCTGAAGGGATGCCCCGTGTCCTGCTGCTTGATCAAAATTGTGGTAGGCTATTGTGATAAAGATTACCGTTTCTCTCTCCAAGCTTTAATCACGCAGTCCTTCAACGGACCCCCTGTTTGGTATACTATGACACAAGGCCTAATTTCCGGGACCAGTACGTGAGGAGACTAAACAGTTTGGCCTTCTTACGTCCAACCCGCTTGTAATCCTTCGCCCGAGACTTCGTTCCCTCTCTAGCTCTCACTACTGCGGATCTCATAACTATAACTAAGCCCATGGATAAACATACCGAAGCAAATATTGCTCATTCGCTATCCTGGATCCTGCTGCGTAGCGGGGATCGGATTCATGGCCTGGCTTCCATCCCTGTG >seq48 AACCTGCGCACTTGATCTTCACTCTTACTCTAACTGCATTATCTTAATAAGGGTGACGCCAAGTTCTAGCGCAGAGGCAGCCGGGACTGCATTTAAAACCTGACATTTGGCATACATCTCTCGAGTTACAGTTCATGATCCTTTATAATGTTACGCGCGGTCAGTATGTGGCCTCCGCTACAAACGCTGATATACGGAAACACAGCTCCGTGGAGGCTGAGAGGGTAAGGCTTCCTCAGTTTTTTCCGGTTCAGTCACAGCTCCAGAGGACTACGCAACATCGTTCGGAAACTTGCAACCATTGGGGCATGGCGGCCTCAGGATTACAAGACCGTCCGAAGCGCCCAGATGCTAGCAGCGATACACGTTATTTAATTGGC >seq49 CCAGGAGGGACAAGGCAGCGTTATACCGGACAGTCGACACCAAGAAAGACCGCGACCTAGTTAAAGTCAGAGGTTATGATGTGGCGTGGAGTTGCCCTTGGCGTTGCCCGCATCAGAAGTCTGCGTGGGTAGTGCTCAGGCCACCCATCTCGTAGAAGAACCACGTGCTGACACGATCGGTCTGGGCGCCTCGCATTAGTTCAACGGACCGCTGGGTTTGAAAGTATGGACTGGACGACCGACTCACAGTGTAGCGGTAGTATCGGTCGCCGTACGGTATCCCAGCTAGTGGCGGATCTTGACGGACACTACAGCCTGTGTTGTTCACGTGTAATTTATTAGCGCTACCCATAAGAGATTGTGGAGCGAATTGTGATCACTGAG